92 research outputs found
Une approche par boosting Ă la sĂ©lection de modĂšles pour lâanalyse syntaxique statistique
International audienceIn this work we present our approach to model selection for statistical parsing via boosting. The method is used to target the inefficiency of current feature selection methods, in that it allows a constant feature selection time at each iteration rather than the increasing selection time of current standard forward wrapper methods. With the aim of performing feature selection on very high dimensional data, in particular for parsing morphologically rich languages, we test the approach, which uses the multiclass AdaBoost algorithm SAMME (Zhu et al., 2006), on French data from the French Treebank, using a multilingual discriminative constituency parser (Crabbé, 2014). Current results show that the method is indeed far more efficient than a naïve method, and the performance of the models produced is promising, with F-scores comparable to carefully selected manual models. We provide some perspectives to improve on these performances in future work
Une approche par boosting Ă la sĂ©lection de modĂšles pour lâanalyse syntaxique statistique
International audienceIn this work we present our approach to model selection for statistical parsing via boosting. The method is used to target the inefficiency of current feature selection methods, in that it allows a constant feature selection time at each iteration rather than the increasing selection time of current standard forward wrapper methods. With the aim of performing feature selection on very high dimensional data, in particular for parsing morphologically rich languages, we test the approach, which uses the multiclass AdaBoost algorithm SAMME (Zhu et al., 2006), on French data from the French Treebank, using a multilingual discriminative constituency parser (Crabbé, 2014). Current results show that the method is indeed far more efficient than a naïve method, and the performance of the models produced is promising, with F-scores comparable to carefully selected manual models. We provide some perspectives to improve on these performances in future work
Document Sub-structure in Neural Machine Translation
Current approaches to machine translation (MT) either translate sentences in
isolation, disregarding the context they appear in, or model context at the
level of the full document, without a notion of any internal structure the
document may have. In this work we consider the fact that documents are rarely
homogeneous blocks of text, but rather consist of parts covering different
topics. Some documents, such as biographies and encyclopedia entries, have
highly predictable, regular structures in which sections are characterised by
different topics. We draw inspiration from Louis and Webber (2014) who use this
information to improve statistical MT and transfer their proposal into the
framework of neural MT. We compare two different methods of including
information about the topic of the section within which each sentence is found:
one using side constraints and the other using a cache-based model. We create
and release the data on which we run our experiments - parallel corpora for
three language pairs (Chinese-English, French-English, Bulgarian-English) from
Wikipedia biographies, which we extract automatically, preserving the
boundaries of sections within the articles.Comment: Accepted at LREC 202
Few-shot learning through contextual data augmentation
Machine translation (MT) models used in industries with constantly changing
topics, such as translation or news agencies, need to adapt to new data to
maintain their performance over time. Our aim is to teach a pre-trained MT
model to translate previously unseen words accurately, based on very few
examples. We propose (i) an experimental setup allowing us to simulate novel
vocabulary appearing in human-submitted translations, and (ii) corresponding
evaluation metrics to compare our approaches. We extend a data augmentation
approach using a pre-trained language model to create training examples with
similar contexts for novel words. We compare different fine-tuning and data
augmentation approaches and show that adaptation on the scale of one to five
examples is possible. Combining data augmentation with randomly selected
training sentences leads to the highest BLEU score and accuracy improvements.
Impressively, with only 1 to 5 examples, our model reports better accuracy
scores than a reference system trained with on average 313 parallel examples.Comment: 14 pages includince 3 of appendice
Boosting for Efficient Model Selection for Syntactic Parsing
International audienceWe present an efficient model selection method using boosting for transition-based constituency parsing. It is designed for exploring a high-dimensional search space, defined by a large set of feature templates, as for example is typically the case when parsing morphologically rich languages. Our method removes the need to manually define heuristic constraints, which are often imposed in current state-of-the-art selection methods. Our experiments for French show that the method is more efficient and is also capable of producing compact, state-of-the-art models
A Study in Improving BLEU Reference Coverage with Diverse Automatic Paraphrasing
We investigate a long-perceived shortcoming in the typical use of BLEU: its
reliance on a single reference. Using modern neural paraphrasing techniques, we
study whether automatically generating additional diverse references can
provide better coverage of the space of valid translations and thereby improve
its correlation with human judgments. Our experiments on the into-English
language directions of the WMT19 metrics task (at both the system and sentence
level) show that using paraphrased references does generally improve BLEU, and
when it does, the more diverse the better. However, we also show that better
results could be achieved if those paraphrases were to specifically target the
parts of the space most relevant to the MT outputs being evaluated. Moreover,
the gains remain slight even when human paraphrases are used, suggesting
inherent limitations to BLEU's capacity to correctly exploit multiple
references. Surprisingly, we also find that adequacy appears to be less
important, as shown by the high results of a strong sampling approach, which
even beats human paraphrases when used with sentence-level BLEU.Comment: Accepted in the Findings of EMNLP 202
- âŠ